{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Trust Region Policy Optimization\n",
    "\n",
    "Trust region policy optimization shortly known as TRPO is one of the most popularly used algorithms in deep reinforcement learning. TRPO is a policy gradient algorithm and it acts as an improvement to the policy gradient with baseline we learned in chapter 8. We learned that policy gradient is an on-policy algorithm meaning that on every iteration, we improve the same policy with which we are generating trajectories. On every iteration, we update the parameter of our network and try to find the improved policy. The update rule for updating the parameter $\\theta$ of our network is given as follows:\n",
    "\n",
    "$$\\theta = \\theta + \\alpha \\nabla_{\\theta} J({\\theta}) $$\n",
    "\n",
    "Where $\\nabla_{\\theta} J({\\theta}) $ is the gradient and $\\alpha$ is known as the step size or learning rate. If the step size is large then there will be a large policy update and if it is small then there will be a small update in the policy. How can we find an optimal step size? In the policy gradient method, we keep the step size small and so on every iteration there will be a small improvement in the policy.\n",
    "\n",
    "\n",
    "But what happens if we take a large step on every iteration? Let's suppose we have a policy $\\pi$ parameterized by $\\theta$. So, on every iteration updating $\\theta$ implies that we are improving our policy. If the step size is large then the policy on every iteration varies greatly, that is, the old policy (policy used in the previous iteration) and the new policy (policy used in the current iteration) vary greatly. \n",
    "\n",
    "\n",
    "We learned that if the step size large is then the new policy and old policy will vary greatly. Since we are using parametrized policy, it implies that if we make a large update (large step size) then the parameter of old policy and new policy very heavily and this leads to a problem called model collapse.\n",
    "\n",
    "This is the reason, in the policy gradient method, instead of taking larger steps and update the parameter of our network we take a small step and update the parameter to keep the old policy and new policy closer. But how can we improve this? Can we take a larger step along with maintaining the old policy and new policy closer so that it won't affect our model performance and it will also help us to learn quickly? Yes, this problem is exactly solved by TRPO.\n",
    "\n",
    "TRPO tries to make a large policy update while imposing a constraint that old policy and new policy should not vary too much. Okay, what is this constraint? But first, how can we measure and understand if the old policy and new policy are changing greatly? Here is where we use a measure called KL divergence. The KL divergence is ubiquitous in reinforcement learning. It tells us how two probability distributions are different from each other. So, we can use the KL divergence to understand if our old policy and new policy varies greatly or not. TRPO adds a constraint that the KL divergence between the old policy and new policy should be less than or equal to some constant $\\delta$. That is, when we make a policy update, old policy and a new policy should not vary more than some constant. This constraint is called trust region constraint. \n",
    "\n",
    "Thus, TRPO tries to make a large policy update while imposing the constraint that the parameter of the old policy and a new policy should be within the trust region. Note that in the policy gradient method, we use the parameterized policy. Thus, keeping the parameter of the old policy and new policy within the trust region implies that the old policy and new policy is within the trust region.\n",
    "\n",
    "TRPO guarantees monotonic policy improvement, that is, it guarantees that there will always be a policy improvement on every iteration. This is the fundamental idea behind the TRPO algorithm. \n",
    "\n",
    "To understand how exactly TRPO works, we should understand the math behind TRPO. TRPO has pretty heavy math. But worry not! It will be simple if we understand the fundamental math concepts required to understand TRPO. So, before diving into the TRPO algorithm, first, we will understand several essential math concepts that are required to understand TRPO. Then we will learn how to design TRPO objective function with the trust region constraint and in the end, we will see how to solve the TRPO objective function. \n",
    "\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
